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Abstract: Recent advances of information technology in biomedical sci- 
ences and other applied areas have created numerous large diverse data 
sets with a high dimensional feature space, which provide us a tremendous 
amount of information and new opportunities for improving the quality 
of human life. Meanwhile, great challenges are also created driven by the 
continuous arrival of new data that requires researchers to convert these 
raw data into scientific knowledge in order to benefit from it. Association 
studies of complex diseases using SNP data have become more and more 
popular in biomedical research in recent years. In this paper, we present a 
review of recent statistical advances and challenges for analyzing correlated 
high dimensional SNP data in genomic association studies for complex dis- 
eases. The review includes both general feature reduction approaches for 
high dimensional correlated data and more specific approaches for SNPs 
data, which include unsupervised haplotype mapping, tag SNP selection, 
and supervised SNPs selection using statistical testing/scoring, statistical 
modeling and machine learning methods with an emphasis on how to iden- 
tify interacting loci. 
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1. Introduction 

Correlating genetic variations in DNA sequences with phenotypic differences has 
been one of the grand challenges in biomedical research. Substantial efforts have 
been made to obtain all common genetic variations in humans, including single 
nucleotide polymorphisms (SNPs), deletions and insertions [13]. SNPs are sin- 
gle base pair positions in genomic DNA at which different sequence alternatives 
(alleles) exist in normal individuals in some population(s), wherein the least 
frequent allele has an abundance of 1% or greater [13]. In practice, the term 
"SNP" is used more loosely. Restricting the attention to common SNPs with 
minor allele frequency bigger than a certain cutoff, e.g. 1% will help to filter 
out some "recent" mutations. SNPs are believed to alter the risk for develop- 
ing particular diseases. It is, however, very unlikely that individual SNPs play 
an important role in the development of complex diseases. Instead, high-order 
interactions of SNPs are supposed to explain the differences between low and 
high risk population groups. 

The HapMap Project has collected genotypes of millions of SNPs from popu- 
lations with ancestry from Africa, Asia and Europe and makes this information 
freely available in the public domain [93-95]. To find evidence of association in 
this huge data set is a grand challenge now. Therefore, there is a great need, con- 
ceptually as well as computationally, to develop advanced robust algorithms and 
analytical methods for characterizing genetic variations that are non-redundant 
and identify the target SNPs that are most likely to affect the phenotypes and 
ultimately contribute to disease development. 

Exploiting information redundancy due to associations between SNP markers 
potentially reduces the efforts in terms of time and cost for genetic association 
studies [75]. However, the efficacy of searching for an optimal set of SNPs has 
not been as successful as expected in theory. One primary cause is the high 
dimensionality with highly correlated fcatures/SNPs that can hinder the power 
of the identification of small to moderate genetic effects in complex diseases. 
The need to incorporate covariates of other environmental risk factors as effect 
modifiers or confounders further worsens the "curse of dimensionality" problem 
in mapping genes for complex diseases [16]. Therefore, feature selection for mas- 
sive genomic data in high dimensions has become one of the main tasks to be 
tackled with statistical and computational efforts in the past decade. 
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2. Feature selection methods for high dimensional problems 

The computational and statistical methods that address the "curse of dimen- 
sionality" problem in genomic research can be grouped into three categories: 
filtering, wrapper, and embedded methods. Filtering methods select feature sub- 
sets independently from the learning classifiers and do not incorporate learning. 
Therefore, filtering methods are fast [10; 60; 69; 109]. A weakness of filtering 
methods is that they only consider the individual features in isolation and ig- 
nore the possible interaction among them. Yet, the combination of these features 
may have a combined effect that does not necessarily follow from the individual 
performances of features in the group [73] . One of the consequences of filtering 
methods is that we may end up with many highly correlated features/SNPs 
with highly redundant information that worsens the classification and predic- 
tion performance. If there is a limit on the number of features to be chosen, 
then we may not be able to include all the informative ones. 

To address this problem in filtering methods, wrapper methods wrap around 
a particular learning algorithm that can assess the selected feature subsets in 
terms of the estimated classification errors and then build the final classifier [44] . 
Wrapper methods use a learning machine to measure the quality of subsets of 
features. One of the well-known wrapper methods for feature selection is Sup- 
port Vector Machine Recursive Feature Elimination, which refines the optimum 
feature set by using a Support Vector Machine, [33]. The idea of SVMRFE is 
that the orientation of the separating hyper-plane found by the SVM can be 
used to select informative features: if the plane is orthogonal to a particular 
feature dimension, then that feature is informative, and vice versa. SVMRFE 
uses the weights of a SVM classifier to produce a feature ranking, and then 
eliminates the feature with smallest weight magnitude recursively. 

Wrapper methods can be used with arbitrary classifiers and can notably re- 
duce the number of features and significantly improve the classification accuracy 
[63; 79]. However, wrapper methods have the drawback that they do not incor- 
porate knowledge about the specific structure of the classification or regression 
function [52]. Moreover, they are more computationally expensive since they 
need to evaluate a cross-validation scheme at each iteration. 

With much better computational efficiency and similar performance to wrap- 
per methods, a relatively new class of approaches for feature selection called 
"embedded methods" has become available in the literature. Lai et al. [52] pro- 
vide the detailed mathematical formulations of embedded methods. Embedded 
methods process feature selection simultaneously with the learning classifier and 
the feature selection can not be separated from the learning. For example, We- 
ston et al. [107] measure the importance of a feature using a bound that is valid 
for Support Vector Machines only, thus it is not possible to use this method 
with, for example, decision trees. 

Therefore the structure of the class of functions under consideration plays a 
crucial role. For an embedded method, every subset of features is modeled by 
a vector a E {0, 1}" of indicator variables, Gi := indicating that a feature is 
present in a subset and Oj := 1 indicating that a feature is absent (i = 1,. . .,n). 
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A parameterized family of classification or regression functions are given as 
follows: / : A x 5ft™ — » 5ft, (a, x) oc f(a, x). The goal of an embedded method is 
to find a vector of indicator variables a* £ {0, 1}™ and a* e A that minimize 
the expected risk R(a,a) = J L[f(a,a * x), y]dF(x, y), where * denotes the 
pointwisc product, L is a loss function and P is a measure on the domain of the 
training data (X;Y). 

One may impose some additional constraints for penalty or regularizations to 
achieve sparseness: s(er) < <r , where s : [0, 1]" — > 5ft + measures the sparsity of a 
given indicator vector a. For example, s could be defined as: s(a) := Iq(cf) < Co, 
that is to bound the zero "norm" lo(a), which counts the number of nonzero 
entries in a. The Ll-norm, L2-norm, and Loo-norm or the elastic-net penalty, 
a mixture of the L2-norm and the Ll-norm penalties [105] are also proposed to 
achieve automatic feature selections by shrinking the fitted coefficients toward 
zero. These automatic feature selection methods also benefit from the reduction 
in the fitted coefficients' variance. 

One of the merits of an embedded method is that it intends to find the 
feature subset of a certain size that leads to the best possible generalization or 
equivalcntly to minimal risk, which can be seen from the above formulation. 
Therefore, the function that measures the quality of a scaling factor can be 
evaluated faster than a cross-validation error estimation procedure. Moreover, 
they turn the multiple testing problems for feature selection into an optimization 
problem in the nonparametric setting. Some recent studies [90; 105] have shown 
that they are more computational efficient and asymptotically optimal for high 
dimensional data. 

Embedded methods tend to have higher capacity than filtering methods and 
are therefore more likely to overfit. We thus expect filtering methods to perform 
better if only a small amount of training data is available. Embedded meth- 
ods eventually outperform filtering methods as the number of training sam- 
ples increase. LASSO proposed by Tibshirani [97; 98], logic regression with the 
regularized Laplacian prior [51] and Baycsian regularized neural network with 
automatic relevance determination [56] are examples of embedded methods. 

Note that the three feature reduction methods, filter, wrapper and embedded 
methods discussed in this section may perform differently when applied to cate- 
gorical SNP data instead of continuous gene expression data, in which there are 
only three genotypes, two homozygous genotypes and one heterozygous geno- 
type. Next we will focus on the review of the recently developed categorical SNP 
data reduction methods in genome wide association studies. 

3. SNP selections in genome-wide association studies 

A major aim of association studies is the identification of polymorphisms, usu- 
ally single nucleotide polymorphisms (SNPs) associated with a trait or disease 
status. There are several major computational and statistical tracks for SNP 
selections, which we will review next [3; 18; 25; 34; 35; 40; 46; 48; 59; 115]. 
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3.1. Statistical measures and testing for SNP-disease association 

Specifically, in genome-wide disease association studies, various statistical mea- 
sures and testing based approaches have been proposed for selecting a sub- 
set of SNPs [17; 30; 36; 57; 85; 89]. These include Linkage Disequilibrium 
(LD) based SNP selection and supervised SNP selection. Linkage Disequilibrium 
based methods for selecting a maximally informative set of SNPs for associa- 
tion analyses were developed first [24; 92; 101; 102; 108]. For instance, Zhang 
and Jin [114] identified tagSNPs from haplotype data in two steps; first, they 
identified haplotype blocks and then identified tagSNPs that best distinguish 
the haplotypes within a haplotype block. This method is applicable for all types 
of association studies. Anderson and Novcmbre [1] and Mannila et al. [61] pro- 
posed finding haplotype block boundaries using minimum description length. 
Entropy-based measure for SNP selections were proposed by Hampe, Schreiber, 
and Krawczak [36] and Zhao, Boerwinkle, and Xiong [117]. Beckmann et al. [8] 
presented Mantel statistics for SNP selections and disease mapping purposes 
by using haplotype sharing to correlate temporal and spatial distributions of 
cancer in a generalized regression model. 

A sliding window approach developed by Neale and Sham [68] combines p- 
values from multiple independent tests using x 2 = — 2 i l°9{Pi) ~ xim- 
Here, pi is the p-value of association between SNPi and presence of disease, 
and Tji is the number of SNPs in the sliding window. The test statistic \ 2 
has a chi-square distribution with 2m degrees of freedom. The sliding window 
incorporates the ordering of SNPs on the chromosome and merges results across 
adjacent windows to detect chromosome regions with significant associations 
[27; 84; 113]. However, it does not consider the distance between them and the 
implicit assumption is that the SNPs are equally spaced. 

The scan statistic [26; 39; 54; 91; 104] does account for the spacing and order- 
ing of SNPs on the chromosome, but it does not consider gene-gene interactions. 
For instance, Sun, et al. [91] developed a chromosomal scan statistic approach, 
which includes two parts: (i) Identifying SNP clusters; (ii) Identifying SNP clus- 
ters with significant disease association. This scan method assumes the position 
of each SNP is randomly determined by a Poisson process. The lengths between 
two adjacent SNPs have an exponential distribution and the sum-of-lengths 
between SNPs has a Gamma distribution. Under the above assumptions, the 
clusters of SNPs are first identified by testing the hypothesis that whether the 
observed length between a set of SNPs (combined interval between these SNPs) 
is equal (null hypothesis) or less than (alternative hypothesis) the expected 
length. Rejection of the null hypothesis identifies this group of SNPs as clus- 
ter. To further identify SNP clusters with significant disease association, disease 
outcomes are incorporated and Pearson Chi-square p-values are computed for 
associations of significance. 

Other test statistic approaches, such as the score statistic [81; 82], and weighted- 
average statistic [87] for disease mapping in case-control studies were also pro- 
posed for SNP selection in genetic association studies. Cheng et al. [20] propose 
using the expectation maximization (EM) algorithm to estimate haplotype fre- 
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quencies of multiple linked SNPs, and follow this by constructing a contingency 
table statistic S for LD analysis based on the estimated haplotype frequencies. 
An empirical p- value is obtained based on the null distribution of the maximum 
of S (S*) from a large number (e.g., 1,000 or more) of randomized permuta- 
tions. This method is developed for mapping functional sites or regions from 
case-control data using haplotypes of multiple linked SNPs. 

All these conventional test based filter approaches estimate the association 
between each SNP (or multiple SNPs) and a phenotype, and then use the cor- 
responding p- values to prioritize the results. One drawback is that one may end 
up with many highly correlated SNPs or genes with high redundancy informa- 
tion, which can be hurdles for further classifications and predictions. Also the 
test based approaches can not incorporate many environmental factors to ac- 
count for gene-environment interactions. Furthermore, the non-independence of 
SNPs in physical proximity (Linkage Disequilibrium) may cause problems for 
multiple testing scenarios with correlated tests [6; 7; 23; 70; 80; 112]. Simple cor- 
rections may lead to either conservative p- values if Bonferroni correction is used 
or become computationally expensive, if permutation is used [84]. Nyholt [70] 
proposed a method for efficiently accounting for multiple testing of many SNPs 
in an association study that involves estimating an "effective number" of inde- 
pendent tests, and then adjusting the smallest observed p-value using Sidak's 
formula based on this number of tests. Salyakina et al. [80] further evaluated 
this method. 

Note that the "multiple testing problem" discussed here differs from the 
"curse of dimensionality problem" , so it poses different challenges. "Multiple 
testing problem" is caused by the high dimensionality of the predictors (includ- 
ing features plus possible interactions of features) and the complex correlation 
structures of the predictors, while the "curse of dimensionality problem" arises 
when considering the interaction of many features, i.e., there are not enough 
observations in each combination of those features. 

Last, but not least, these existing testing based approaches ignore some in- 
formation about the SNPs, such as sub-structures of the underlying population 
(admixture problem). This may lead to spurious results as well as suffer from 
low power. This may explain why reproducibility has become a major issue in 
genomic association studies for complex diseases. The same data set can show 
a highly significant association with one method, whereas a different method 
shows no or only a marginal association. Also, given the low prior probabil- 
ity of causality for each SNP in the genome, rigorous standards of statistical 
significance are needed for genome-wide association studies in order to avoid a 
flood of false-positive results. Multiple replications in large samples may pro- 
vide the most straightforward path in identifying robust and broadly relevant 
associations. 

3.2. Supervised statistical models and statistical learning algorithms 

In order to incorporate environmental factors and other covariates/confoundcrs 
into the genomic association studies, various model based approaches have been 
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developed. He and Zclikovsky [37] proposed tagSNPs for unphased genotypes 
based on multiple linear regressions. Durrant et al. [28] adopted a logistic- 
regression model applicable to whole-genome screens using sliding windows; it 
controls for other (continuous) confoundcrs and gene- environment interactions. 
Yet, they have to make assumptions on the disease model, which is usually un- 
known in practice. Moreover, the effects of violations of these assumptions are 
unpredictable in general. Baker [5] applied a simple loglincar model for haplo- 
type effects in a case-control study involving two unphased genotypes. 

The haplotype trend regression, developed by Zaykin et al. [Ill], fits a model 
of additive effects of haplotypes that takes a series of marker genotypes, com- 
putes haplotype probabilities for each observation using the composite haplotype 
method (CHM), and forms a linear regression on the response using the hap- 
lotype probabilities as the regression matrix. A nonparametric method called 
Haplotype Pattern Mining (HPM) was proposed to identify disease associated 
haplotype patterns from case-control data. HPM has two steps: In step I, given 
the data-markers, haplotypes, and phenotypes, the goal is to output all haplo- 
type patterns that are strongly associated with the disease status for a given 
value of the association threshold; In step II, it is to find the "gene location" , by 
counting frequency that one marker appears in the haplotype patterns identified 
in the first step. Since the HPM method utilizes the disease status (case/control), 
it is a supervised mining approach. Toivonen et al. [99] showed that HPM docs 
not require any assumptions on the inheritance patterns and has good local- 
ization power, even when the number of phenocopies is large. Knorr-Held and 
Rue [49] developed Markov random field models on block updating for disease 
mapping. Other model-based approaches that can take into account the spatial 
correlation between markers were also proposed [14; 31; 32; 42; 96; 100; 103; 106]. 

Recently, Schwender and Ickstadt [83] demonstrated logic regression based 
identification of SNP interactions for the disease status in case-control study and 
proposed two measures for quantifying the importance of feature interactions 
for classification. In comparison with some well-known classification methods 
such as CART [12], Random Forests [11], and other regression procedures [108], 
logic regression has shown a good classification performance when applied to 
SNP data. When fitting with categorical features/ variables in the model based 
approaches, i.e. the genotype measurements with two homozygous genotypes 
and one heterozygous genotype, we often define a set of dummy variables that 
represent a single categorical feature/ variable. In order to select the set/group 
of dummy variables that represent a single categorical feature/variablc/SNP 
simultaneously, Yuan and Lin [110] proposed the group-Lars and the group- 
Lasso methods. Park and Hastie [72] proposed several regularization path algo- 
rithms with grouped feature/variable selection for modeling gene-gene interac- 
tions. Multifactor dimensionality reduction has been proposed and implemented 
for SNPs data reduction by Coffey et al. [22], Ritchie et al. [77] and Moore et 
al. [64]. 
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3.3. Unsupervised haplotype mapping approaches 

Haplotype density based clustering algorithms and clustering techniques based 
on the degree of haplotype sharing in affected individuals for haplotype map- 
ping were developed recently. These approaches have advantages of robustness 
since they are nonparamctric and require fewer assumptions in modeling. Fu 
et al. [29] and Zhang et al. [116] proposed Bayesian models for the analysis 
of genetic structure when populations are correlated. Liu et al. [58] employed 
a Bayesian approach to model positions of the historical recombinations and 
mutation events that produced the observed haplotypes from an initial set of 
founders by accounting for all sources of uncertainties. They employed Monte 
Carlo Markov Chain (MCMC) method for parameter estimation and assigned 
haplotypes to clusters representing allele heterogeneity. Molitor et al. [62] mod- 
eled haplotype risks using clusters obtained from a probability model, but their 
method does not take phenocopies into consideration. Both methods were devel- 
oped mainly for haplotype fine mapping and do not scale up for whole-genome 
screens very well. 

Other algorithms for SNPs are hierarchical clustering and graph methods 
[2; 55]. Principal Component Analysis with multiple genotype frequencies was 
also applied to select a subset of correlated SNPs that capture multiple genotype 
variability in the region [!); 57]. However, whether Principal Component Analysis 
is a suitable tool for categorical SNPs information is arguable, since it is more 
appropriate for continuous scale data. The related correspondence analysis may 
be more suitable, but the interpretation of the results from correspondence 
analysis reveals many challenges. 

3-4- Computational intelligence approaches 

Computational intelligence systems [47; 74] hold a great promise for tackling 
the tasks and challenges posed by large, diverse, genomic data for complex 
diseases. Some of these challenges are the identification of genc-gene and gene- 
environment interactions [4; 43; 50; 66; 78], dealing with the notorious "curse of 
dimensionality" , the uncertainty, and unclear, fuzzy boundaries of phenotypes 
for complex diseases [76; 88]. Techniques include neural networks [71], genetic 
algorithms [21], genetic programming [65], evolutionary trees [53], evolutionary 
algorithms [41] and various hybrid approaches. For instance, Moore [64] de- 
veloped a hybrid genetic programming (GP) with a multifactor dimensionality 
reduction method to pick SNPs for epistasis. 

Motsingcr, et al. [67] applied a genetic programming neural network (GPNN) 
approach for detecting epistasis in case-control studies for SNPs data. They eval- 
uated the power of GPNN for identifying high-order gcnc-gcnc interactions and 
applied GPNN to a real data on Parkinson's disease. They developed a Gram- 
matical Evolution Neural Network (GENN), a machine- learning approach to 
detect gene-gene and gene-environment interactions in high dimensional genetic 
epidemiological data. Furthermore, they proposed an Ensemble Learning Ap- 
proach for Set association (ELAS) to detect a set of interacting loci that predicts 
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the complex trait. An important advantage of the hybrid approach is that any 
form of expert knowledge could be used to guide the stochastic search algorithm 
to identify epistatic SNPs in the absence of marginal effects. 

4. Other challenges in genetic association studies of complex 
diseases 

An important challenge that faces molecular association studies in the post ge- 
nomic era is to understand the interconnections from a network of genes and 
their products that are modified by a variety of environmental factors [15; 45]. 
The variety of phenotype definitions leads to a multiplicity of tests that involve 
a large number of comparisons that often result in less power. The need for ade- 
quate algorithms and models for reducing biological and statistical redundancy 
from thousands of SNPs and finding an optimal set of SNPs associated with dis- 
eases are pressing for common complex diseases. Dealing with many dependent 
association tests is one of the emerging issues on the statistical/computational 
side. 

For SNP-disease data, in addition to being large, redundant, diverse and dis- 
tributed, three important characteristics pose challenges for data analysis and 
modeling: (1) heterogeneity, (2) a constantly evolving biological nature and (3) 
complexity. Firstly, there is the heterogeneity of SNP data, in the sense that i) 
the population data involves the population substructure or admixture problem 
and there is locus heterogeneity where a large fraction of the prevalence is due 
to phenocopies; and ii) there is a wide array of data types, including categor- 
ical, continuous, sequence data, as well as temporal, incomplete and missing 
data. Such data sets are large with a lot of redundancy in SNP and haplo- 
type databases. Secondly, they are very dynamic and continuously evolving, 
which means that special knowledge is required when designing the modeling 
techniques. Lastly, but most importantly, these SNP and haplotype data are 
complex with intrinsic features and subtle patterns, in the sense that they are 
very rich in associated complex phenotype traits. 

The difficulty in a SNPs association study is increased by the nature of com- 
plex diseases [38] . Typically, the contribution of single genes as well as of single 
environmental risk factors is small to moderate. Furthermore, most complex 
diseases result from gene-gene and gene-environment interactions [19]. By dis- 
regarding interactions, relative risks of individual genetic variants are expected 
to be small. Disregarding gene- environment interaction also weakens exposure- 
disease and gene-disease associations. In complex diseases, it is likely that a 
combination of genes predisposes for the disease and environmental factors ag- 
gravate the impact of these genes and therefore are jointly responsible for disease 
development in populations (known as epistatic effect). In addition, environmen- 
tal factors, which seem to have only a moderate impact at the population level 
might have larger relative risks in subpopulations with certain genetic predis- 
positions. There are major methodological challenges in the study of genc-gene 
and gene-environment interactions. 
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Other open questions and challenges for new computational approaches in 
analyzing the associations between genetic markers, such as SNPs in complex 
diseases involve several hierarchical levels. First level of complexity: How to 
analyze multiple SNPs in a single gene? How to analyze interactions among 
multiple SNPs in a single gene? Second level of complexity: How to analyze 
multiple SNPs in multiple genes? How to analyze interactions among multiple 
SNPs in multiple genes? Third level of complexity: How to analyze interactions 
among multiple SNPs in multiple genes and environmental factors? Fourth level 
of complexity: How to analyze associations between SNPs in single or multi- 
ple genes and quantitative traits? How to identify and quantify the percentage 
of the association between genes and diseases explained by the association be- 
tween the same gene and quantitative traits, taking into consideration single 
genes, multiple genes and environmental factors. Lastly, the ultimate goal in ge- 
netic/genomic analysis is to build direct or indirect causal association between 
genetic variants and phenotypes/disease status, but the difficulty here is that 
we do not know if there is association between the SNPs and the disease. How- 
ever, with the development of computational/statistical approaches, we may be 
able to identify these causal associations and construct the pathways related to 
complex diseases. 

5. Discussion 

New advances in human genome research have drawn tremendous attention of 
researchers from multiple fields, including both theoretical scientists and ap- 
plied researchers, especially in the statistical field. Huge amounts of contin- 
uously growing large-scale genomic, proteomic and clinical data for complex 
diseases and phenotype traits have posed ever greater challenges for the compu- 
tational field. Multiple whole genome wide association studies have already been 
completed and have resulted in novel and promising genetic variants for vari- 
ous diseases. In this paper we presented a survey of recent advances and some 
promises of designing, developing and implementing statistical/computational 
methods for identifying SNP markers responsible for common, complex, chronic 
diseases, such as diabetes, cancer, multiple sclerosis, and cardiovascular dis- 
ease and for tackling the challenges, such as gene-gene and gene-environment 
interactions along with the notorious "curse of dimensionality" problem. Suc- 
cess in identifying SNPs and haplotypes conferring susceptibility or resistance 
to common diseases will provide a deeper understanding of the architecture of 
the disease, the risks, and offer a more powerful diagnostic tool and predictive 
treatment. 
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